Molecular Systems Biology — Latest Matching Preprints

1

Single-cell gene programs define subtype identity and metastatic trajectories in renal cell carcinoma

Madrigal, A.; Kim, M.; Mehrjoo, Z.; Nishimura, T.; Saatci, O.; Osakwe, A.; Zavacky, E.; Moslemi, E.; Glennon, K. I.; Dankner, M.; Maritan, S. M.; Kuasne, H.; Pilon, V.; Monast, A.; Soytas, M.; Arseneault, M.; Oikonomopoulos, S.; Harutyunyan, A.; Lu, T.; Rayes, R.; Soto, L. M.; Hernandez-Corchado, A.; Spicer, J. D.; Petrecca, K.; Siegel, P.; Park, M.; Ragoussis, J.; Sahin, O.; Brimo, F.; Tanguay, S.; Riazalhosseini, Y.; Najafabadi, H. S.

2026-07-16 genetic and genomic medicine 10.64898/2026.07.14.26357682 medRxiv

Top 0.6%

3.4%

Show abstract

While extensive cellular heterogeneity in renal cell carcinomas (RCC) is linked to diverse clinical outcomes, our understanding of this diversity is limited to those driven by clonal patterns or activity of canonical pathways. Here, we present a compendium of over 85,000 single-cell gene expression profiles from primary and metastatic tumors as well as patient-derived models across four RCC subtypes, including the rare clear cell papillary renal cell tumors, which we show are often misclassified and for which we identify CASP14 as a highly sensitive and specific biomarker. We dissect malignant cell variation within and across tumors using a generative modeling framework that accounts for clonal and copy number-driven expression shifts, defining 59 gene expression programs that deconstruct canonical pathways into functional submodules with divergent activity patterns, distinct regulators, and differential association with clinical outcomes. Despite the canonical view that VHL-deficient clear cell RCC exists in a constitutive pseudohypoxic state, we show strong intra-tumor variability of a hypoxia inducible factor 2 (HIF2)-driven program linked to poor outcome. We also identify early, spatially organized activation of a complete epithelial-to-mesenchymal transition (EMT) program, loss of epithelial identity, and upregulation of protein translation programs as key characteristics of metastatic progression. Finally, a metastatic signature capturing cellular de-differentiation and translational activity identifies primary tumors associated with adverse clinical outcomes. Together, this resource establishes a framework for dissecting malignant cell heterogeneity, refines RCC subtype classification, and defines transcriptional programs underlying metastasis progression.

2

Analytical perturbation reveals hidden instability of biological phenotypes

Piorkowska, N. J.; Ostromecki, A.; Franik, G.; Bizon, A.

2026-07-16 endocrinology 10.64898/2026.07.13.26357916 medRxiv

Top 0.6%

3.3%

Show abstract

Background Unsupervised machine learning has become a cornerstone of computational phenotyping across clinical medicine, genomics, imaging, and multi-omics research. However, phenotype discovery relies on a sequence of analytical decisions - including missing-data handling, preprocessing, dimensionality reduction, clustering methodology, and stochastic initialization - that are rarely evaluated collectively. Although clustering stability has been extensively investigated, the robustness of complete analytical workflows remains largely unexplored. Results We developed an Analytical Perturbation Framework that systematically quantifies the robustness of phenotype discovery by perturbing complete unsupervised learning workflows rather than individual clustering algorithms. Using a real-world cohort of 1,286 women with polycystic ovary syndrome (PCOS), we generated 116 valid analytical pipelines comprising alternative preprocessing strategies, missing-data handling methods, dimensionality reduction approaches, clustering algorithms, and random initializations. Agreement between independently generated phenotype solutions was consistently low (median Adjusted Rand Index = 0.079), indicating substantial sensitivity of phenotype discovery to routine analytical decisions. Variance decomposition identified preprocessing as the largest contributor to phenotype instability (22.8%), followed by clustering methodology (14.6%), whereas stochastic initialization explained only 3.1% of the observed variability. At the patient level, most individuals exhibited reproducible phenotype assignments (median Patient Robustness Score = 0.719), although a substantial subgroup showed markedly lower assignment stability. Feature perturbation analyses identified follicle-stimulating hormone, anti-thyroglobulin antibodies, anti-thyroid peroxidase antibodies, total testosterone, luteinizing hormone, and androstenedione as the strongest contributors to computational robustness, rather than biological importance. Finally, phenotype solutions demonstrating greater computational robustness also exhibited greater biological coherence during independent validation.

3

From amplicon to antigen: a quantified transmission map that nominates multi-antigen antibody-drug-conjugate co-target sets across cancer types

Lam, J. M.; Walker-Samuel, S.; Pennycuick, A.

2026-07-16 oncology 10.64898/2026.07.13.26357987 medRxiv

Top 3%

0.9%

Show abstract

Somatic copy-number amplification is pervasive in cancer, and the genes it carries are candidate drug targets - but only those whose amplification is transmitted to accessible surface protein can be reached by an antibody-drug conjugate (ADC). We build an integrated map of copy-number-to-protein transmission across six tumour types and ask, for every amplified gene, whether its dosage reaches the surface. Copy number transmits to mRNA (median per-gene r = 0.21) but is attenuated at the protein level in 85% of genes, and the mRNA ranking is largely preserved to protein (rho = 0.70); the ranking is set principally at the chromatin/transcription step - among directly measured regulatory inputs, promoter DNA methylation and tumour chromatin accessibility each explain about an order of magnitude more of the transmission variance than gene structure, and do so complementarily. Critically, transmissibility is a stable, gene-intrinsic property: it is predictable from gene properties alone, with no proteomic input, at a leave-gene-out rank correlation of 0.52 (R2 = 0.29); it is not positional (holding out whole chromosome arms changes accuracy by 0.001); and it transfers across lineages (Kendall W = 0.97 across leave-one-lineage-out refits). This licenses a predictor that nominates surface targets in cancer types that lack a tissue-referenced proteome, combining direct protein measurement where it is available with prediction where it is not. Requiring co-elevation on a recurrent amplicon with measured transmissibility and an accessible extracellular ectodomain nominates 22 surface antigens on 18 distinct recurrent amplicons across four cancer types (renal, endometrial and both lung subtypes) - for example ITGB8+TSPAN13+TTYH3 on lung 7p, NCSTN+HSD17B7+MPZL1 on 1q (recurrent in several types), the transferrin receptor TFRC on squamous 3q, and FZD1 on clear-cell renal 7q; 21 of the 22 are non-driver passengers and 10 are confirmed on the experimental Cell Surface Protein Atlas. In single malignant cells, against a null that controls for per-cell sequencing depth, the co-detected constructs sit at a modest 1.05-1.45x above independence (p < 0.001, donor-block bootstrap intervals clear of 1.0), and at binding-relevant thresholds the normal-tissue co-expression collapses - so an avidity AND-gate that binds stably only where the antigens co-occur would spare normal cells that carry only one. Observed transmissibility itself transfers strongly between the two lung subtypes ({rho} = 0.88) and remains positive across distant lineages, consistent with the shared cell-of-origin regulation the map implies. Single-cell co-detection is demonstrated wherever a malignant single-cell atlas exists (both lung subtypes and glioblastoma - the latter entirely from prediction, using no GBM surface-abundance measurement); the remaining cohorts are nominated on the same genetic and topological evidence. The result is a pan-cancer, confidence-tiered catalogue of multi-antigen ADC co-target sets with a concrete plan to test them.

4

Gradient-guided adapter merging for neuroimaging vision-language models

Bit, S.; Guney, O. B.; Jia, S.; Kolachalama, V. B.

2026-07-21 health informatics 10.64898/2026.07.18.26358397 medRxiv

Top 6%

0.3%

Show abstract

Automated interpretation of neuroimaging studies requires simultaneous assessment of multiple imaging evidence variables, each tied to distinct anatomical structures. Vision-language models (VLMs) offer a unified framework for multi-task analysis, but adapting pre-trained VLMs remains challenging. Full fine-tuning is computationally prohibitive, and joint multi-task training requires simultaneous access to all task data, which is often infeasible in clinical settings. Although model merging enables multi-task composition without joint re-training, existing methods focus on post-hoc algorithms with limited extension to VLMs and minimal application to neuroimaging. Here, we present GRadient-guided Adapter Merging (GRAM), a layer-selective low-rank adaptation (LoRA)-based fine-tuning and merging framework for multi-task neuroimaging visual question-answering (VQA). GRAM uses a gradient ratio that contrasts class-specific gradients to identify task-discriminative layers, and applies subspace-constrained projected gradient descent to restrict LoRA updates to directions consistent with the geometry of the pre-trained model. We leveraged a structured VQA benchmark, developed from the National Alzheimer's Coordinating Center (NACC) dataset, that pairs multi-sequence brain MRI studies with question-answer pairs across clinically relevant imaging evidence variables. Experiments on the VQA benchmark showed that GRAM outperformed or matched all-layer LoRA fine-tuning and a standard merging baseline while reducing inter-task interference during merging, and approached or surpassed the performance of joint multi-task training without joint re-training.

5

How bursty infectiousness shapes epidemic dynamics

Kissler, S. M.

2026-07-17 epidemiology 10.64898/2026.07.15.26358199 medRxiv

Top 7%

0.3%

Show abstract

An epidemic's expected course is determined by the magnitude and timing of a typical person's infectiousness --- captured, in turn, by the basic reproduction number and the generation-time distribution. These fundamental, population-average quantities can mask individual-level variation that shapes how an epidemic actually unfolds: for example, individual variation in the magnitude of infectiousness (overdispersion) creates superspreading, a key feature of the SARS-CoV-1 and SARS-CoV-2 epidemics. However, the impact of individual variation in infectiousness timing is less well understood. Here, we demonstrate that individual infectiousness timing varies substantially and to different degrees across pathogens. For some common pathogens, including influenza, measles, and SARS-CoV-2, infectiousness is "bursty", or highly concentrated and variably-timed across individuals: for example, the window of appreciable infectiousness for SARS-CoV-2 may last for roughly a day, vs. the 9--12 days usually quoted. We show that bursty infectiousness creates superspreading without inherent superspreaders, makes epidemic timing more variable, amplifies the time-sensitivity of common interventions, and complicates inference of key epidemiological parameters. Together with the reproduction number, the generation-time distribution, and overdispersion, burstiness completes a family of basic parameters that govern how epidemics unfold.

6

CuGen: A GPU-accelerated framework for large-scale genomics

Kiiskinen, T.; Richland, J.; Wang, W.; Lu, W. S.; Balasubramanian, N.; Hastie, T.; Tibshirani, R.; Rivas, M. A.

2026-07-17 genetic and genomic medicine 10.64898/2026.07.15.26358178 medRxiv

Top 8%

0.3%

Show abstract

Biobank-scale genomic analyses remain computationally expensive, CPU-bound workflows, particularly when adjusting for confounding. Here, we present CuGen, a GPU-accelerated framework for large-scale genomics. CuGen uses UltraLasso, a novel hierarchical application of univariate-guided sparse regression (uniLasso), to select a compact, phenotype-informed active set of fewer than 30,000 variants. This achieves robust leave-one-chromosome-out (LOCO) confounding control, enabling both downstream GWAS and in-sample fine-mapping. Additionally, we introduce the .cugen file format, a genotype representation designed for memory-optimized, high-throughput streaming and random access on GPU hardware. Building on this substrate, we provide a general GPU-accelerated genomics toolkit handling polygenic prediction, data manipulation, quality control, analysis, and visualization. We demonstrate CuGen's efficacy in the UK Biobank with up to 408,624 individuals, where the full GWAS pipeline and fine-mapping against 6.8 million imputed variants completes in approximately 10 minutes on a single high-throughput GPU with 80 GB of memory. The pipeline scales efficiently to massive phenome-wide analyses with sublinear resource consumption.

7

Cross-database validation reveals distinct layers of transportability in ICU delirium prediction

Ni, S.; Sato, K.

2026-07-21 health informatics 10.64898/2026.07.19.26358409 medRxiv

Top 8%

0.2%

Show abstract

External validation of clinical AI emphasizes discrimination, although deployment requires the endpoint, probability estimates and operating policy to transport. Here we show that these layers diverged in retrospective bidirectional evaluation of five model families across eICU and MIMIC-IV. Coarse-label AUROC fell from 0.87-0.92 internally to 0.66-0.83 during source-only transfer. For assessment-conditioned repeated monitoring of persistence or recurrence, external AUROC reached 0.76-0.94, but removing assessment history reduced it by 0.16-0.32; broader features did not help consistently. Transported scores concentrated future-positive ICU stays 2.4-6.9-fold in the top risk decile. Development-selected cutoffs alerted 0.3-2.0% of prediction rows and captured 9.2-11.0% of future-positive rows; after deduplication, 4.9-12.2% of stays were alerted, capturing 43.9-49.4% of future-positive stays. Thus, ranking can persist while probability and policy transport remain site dependent. Layered validation is a prerequisite for prospective evaluation, not evidence of clinical benefit.

8

Proteogenomic mapping of multimorbidity identifies C1R linking coronary artery disease and dementia

Li, L.; Tang, Z.; Zhong, Z.; Geng, T.; Guo, Y.; Liao, Y.; Demirkan, A.; Bowden, J.; Bragg, F.; Pan, A.; Sun, X.; Liu, J.; Liu, G.; Liu, J.

2026-07-16 genetic and genomic medicine 10.64898/2026.07.14.26358022 medRxiv

Top 8%

0.2%

Show abstract

Multimorbidity is highly prevalent in ageing populations, yet its shared molecular basis remains poorly defined, limiting the development of therapies that target multiple conditions. We systematically integrated measurements of 1,954 circulating proteins from 54,219 individuals in discovery and 35,559 in replication, focusing on ten common age-related diseases: coronary artery disease, chronic kidney disease, chronic obstructive pulmonary disease, dementia, heart failure, major depressive disorder, osteoarthritis, Parkinson's disease, stroke, and type 2 diabetes. Coronary artery disease emerged as a central condition in the multimorbidity network, sharing circulating protein signatures with seven other diseases. Through genetic causal-inference analyses, we identified 40 circulating proteins with cross-disease relevance, of which four were further supported by colocalization of genetic variant associations. Among these, complement C1r, encoded by C1R, emerged as a key link between coronary artery disease and dementia, supported by independent colocalization evidence (PP.H4 = 0.86). Phenome-wide association analyses of C1R variants suggested that this signal was not driven by widespread unrelated genetic effects, but instead may reflect a more specific contribution to coronary artery disease-dementia pathogenesis. In vitro experiments further suggested that fibroblast-derived C1R promotes endothelial inflammation and neuronal apoptosis, providing mechanistic plausibility. Together, these findings position C1R as a biologically plausible and therapeutically relevant molecular link between coronary artery disease and dementia.

9

Multimodal gene prioritization reveals nonlinear regulatory architecture in childhood-onset asthma

Huang, N.; Ragsac, M. F.; Gui, X.; Tantisira, K. G.; Amariuta, T.

2026-07-16 genetic and genomic medicine 10.64898/2026.07.14.26357983 medRxiv

Top 8%

0.2%

Show abstract

Asthma is a heritable complex disease that disproportionately burdens minority and admixed populations in the US. However, the causal genes and regulatory mechanisms governing inherited risk remain largely unresolved. We performed a European-ancestry meta-analysis of 141,894 cases and 1,361,846 controls drawn from the Trans-national Asthma Genetic Consortium (TAGC) and Global Biobank Meta-analysis Initiative (GBMI), yielding an estimated h2SNP of 0.056 (SE = 0.0038) and 275 independently associated loci. To enhance mechanistic inference beyond variant-level associations, we developed a multimodal framework to predict asthma risk integrating GWAS summary statistics, bulk tissue expression quantitative trait loci (eQTL) data from the Genotype-Tissue Expression (GTEx) project, and single-cell gene eQTL data from the OneK1K Project. We performed transcriptome-wide association studies (TWAS) and subsequently applied probabilistic fine-mapping with FOCUS to prioritize putative causal genes expressed in bulk tissues and higher resolution immune cell populations. Fine-mapping asthma-associated genes implicated barrier-immune and metabolic-endocrine tissues alongside adaptive T-cell subsets as the primary mediators of asthma genetic risk, resolving canonical CD4+ Th2 effector genes including IL1RL1, TSLP, STAT6, and GATA3. Using these prioritized genes, we constructed a polygenic transcriptome risk score (PTRS) using random forest to integrate gene-level effects across critical tissues and cell types. Evaluated in two ancestrally distinct pediatric asthma cohorts, the Childhood Asthma Management Program (CAMP) and the Genetics of Asthma in Costa Rica Study (GACRS), our PTRS demonstrated improved transferability over the standard variant-level and gene-level baseline models. While modest common variant heritability limits the discriminative power of our models, we estimated a theoretical maximum achievable area under the receiver operating characteristic (AUROC) curve of 0.64. Our integrative nonlinear model of PRS-CSx and cross-modal (bulk tissue and single cell) FOCUS PTRS resulted in the best cross-cohort performance (CAMP AUC = 0.632, sd = 0.04, 3.55 case/control odds ratio in top vs. bottom quartiles), representing an increase of +0.118 AUC over PRS-CSx, +0.067 AUC over tissue-specific TWAS pruning and thresholding, and +0.041 AUC over cell-type-specific FOCUS PTRS. Our results demonstrate that modeling nonlinear interactions between variant- and gene-level effects across both bulk tissue and single cell eQTL data improves our ability to determine high-risk individuals and to explain the likely mechanisms driving genetic susceptibility of childhood-onset asthma.

10

Exploration of the molecular origins of sex-specific and temporal comorbidity patterns in dementia: insights from the Austrian claims data

Kovacevic, V.; Basaragin, B.; Kovacevic, J.; Zecevic, A.; Danilo Lombardo, S.; Dervic, E.

2026-07-16 genetic and genomic medicine 10.64898/2026.07.14.26357961 medRxiv

Top 9%

0.2%

Show abstract

Dementia is a progressive condition that impairs cognitive processes such as memory, decision making, and the ability to manage daily activities. Recent estimates suggest that more than half of all dementia cases could be preventable by addressing their risk factors, including disease comorbidities such as diabetes and vision loss. Yet, we lack a comprehensive molecular map of dementia comorbidities. In this work, we analyzed Austrian nationwide hospital claims data, comprising 13 million hospital stays from 2015 to 2019, to systematically assess dementia-related risk across disease comorbidity patterns, covering both their molecular relationships and their epidemiological overrepresentation. We identified disease trajectories occurring before and at the time of dementia diagnosis, revealing both sex-specific and shared comorbidity patterns. Overall, we identified 51 potential risk factors, with a prominent contribution from endocrine and metabolic disorders. While Parkinson's disease emerged as a strong molecularly related driver of dementia, we also identified emerging and previously under chracterized risk factors, including vitamin D deficiency. This integrative framework provides a comprehensive view of dementia associated disease networks and identifies novel, potentially modifiable risk factors. These results offer new opportunities for targeted prevention strategies and advance our understanding of the complex interplay between comorbidities and dementia development.

11

Early identification of suboptimal responders to metformin in type 2 diabetes using long-term real-world HbA1c trajectories

Yang, E.; Riselli, A.; Xu, F.; Sridhar, S. B.; Kvale, M.; Giacomini, K. M.; Hedderson, M. M.; Yee, S. W.; Savic, R. M.

2026-07-20 endocrinology 10.64898/2026.07.17.26357984 medRxiv

Top 10%

0.2%

Show abstract

Aims Metformin remains the primary treatment for type 2 diabetes, yet over 40% of patients fail to maintain glycaemic control. We aimed to identify patients unlikely to respond to metformin prior to treatment initiation and to evaluate whether on-treatment management can improve glycaemic outcomes in suboptimal responders, informing early treatment decisions. Materials and Methods We analyzed 59,881 longitudinal HbA1c measurements from 7,105 patients with type 2 diabetes receiving metformin monotherapy using real-world electronic health records from Kaiser Permanente Northern California with up to six years of follow-up. We integrated demographic, clinical, genetic, and pharmacological factors to characterize metformin responder phenotypes and quantify the impact of adherence and weight control on time to glycaemic failure. Results Three distinct trajectory-based phenotypes were identified: good (63.6%), poor (8.9%), and non-responders (27.5%). Poor responders initially achieved glycaemic targets but lost control within 2.5 years, while non-responders showed minimal HbA1c reduction and failed within 1 year. Five baseline factors-HbA1c, age at diagnosis, body mass index, sex, and estimated glomerular filtration rate-classified phenotypes with good discrimination (area under the receiver operating characteristic curve = 0.84). Incorporating on-treatment HbA1c further enhanced identification of non-responders. Among suboptimal responders, weight control and improved adherence delayed glycaemic failure by approximately 7 months; however, eventual glycaemic failure remained likely. Conclusions We characterized three clinically relevant metformin responder phenotypes and showed that suboptimal responders can be identified early using baseline features. Poor and non-responders are unlikely to achieve durable glycaemic control with metformin alone and may require alternative treatment strategies.

12

Muscle proteins in plasma associate to distinguished phenotypes in amyotrophic lateral sclerosis

Azizi, L.; Aksoylu, I.; Bueno Alvez, M.; Foucher, J.; Juto, A.; Seitz, C.; Press, R.; Samuelsson, K.; Kläppe, U.; Uhlen, M.; Edfors, F.; Bergström, S.; Fang, F.; Nilsson, P.; Öijerstedt, L.; Manberg, A.; Ingre, C.

2026-07-16 neurology 10.64898/2026.07.14.26357727 medRxiv

Top 10%

0.2%

Show abstract

Background: Amyotrophic lateral sclerosis (ALS) is a neurodegenerative disease characterized by death of upper and lower motor neurons, usually presented with clinical heterogeneity. Fluid biomarker development remains dominated by neurofilament light chain (NEFL), a marker of neuroaxonal injury. NEFL is however unspecific to ALS and its phenotypes and there is currently a lack of biomarkers that capture ALS heterogeneity such as onset site and ALS-frontotemporal spectrum disorder (ALS-FTSD). Therefore, we investigated whether plasma proteomics could reveal pathway-level signatures that stratify and explain ALS heterogeneity. Methods: We profiled ~5,400 plasma proteins (Olink Explore HT) in 299 patients with ALS and 50 age- and sex comparable healthy controls. We used two complementary analytic frameworks: (i) differential protein abundance analysis to identify altered proteins in ALS and across clinical subgroups, and (ii) weighted gene correlation network analysis (WGCNA) to identify coordinated protein modules and relate them to ALS diagnosis and to ALS-specific clinical traits (site of onset, ALS-FTSD, ALS functional rating scale-revised (ALSFRS-R) score, and plasma NEFL). Results: Differential abundance analysis identified 56 proteins altered in ALS versus controls, of which 40 were increased. WGCNA identified 11 co-expression modules, with ALS samples having the strongest correlation to a protein module (n=51) highly enriched for muscle-related proteins. Out of the 40 proteins that had increased expression levels, 29 overlapped with the muscle-enriched protein module, indicating that muscle related proteins are the dominant circulating proteomic signature in ALS. This signal extended to clinical stratification: spinal-onset patients showed a strong positive association with the muscle-module. Further, differential abundance analysis of spinal- versus bulbar-onset ALS identified changes that mapped predominantly to the same module, supporting a molecular signature of onset phenotype. In contrast, cognitive status (ALS-FTSD) mapped to distinct modules enriched for extracellular matrix/cell-adhesion pathways, consistent with a separable biological axis of disease heterogeneity. Although multiple modules correlated with NEFL, trait-specific signatures were not fully explained by neuroaxonal injury. Notably, the muscle-enriched module increased with higher NEFL and lower ALSFRS-R, supporting its interpretation as a severity-linked, muscle-involvement proxy. Conclusions: Large-scale plasma proteomics reveals that heterogeneity in ALS reflects underlying biological structures. We identified a dominant muscle-associated protein network that distinguished ALS patients from controls and correlated with disease onset phenotype and severity, alongside distinct protein networks linked to ALS-FTSD. By integrating differential protein abundance with network-based analysis, we defined pathway-level biomarker signatures that extend beyond NEFL, enabling biologically informed patient stratification and improved therapeutic monitoring.

13

Nocturnal cough as a syndromic surveillance signal for respiratory illness in England

Irons, T.; Carlsson, E.; Tang, M. L.; Mellor, J.; Rubin, C.; Allen, A.; Elliot, A. J.; Kageback, M.; Packham, J.

2026-07-21 epidemiology 10.64898/2026.07.20.26357937 medRxiv

Top 11%

0.1%

Show abstract

We evaluated aggregated, privacy-preserving smartphone-detected nocturnal cough activity from the Sleep Cycle application as a potential syndromic surveillance signal in England. Weekly cough metrics from January 2023 to January 2026 were compared with UK Health Security Agency indicators: NHS 111 acute respiratory infection (ARI) triage calls, influenza and COVID-19 PCR positivity, and hospital admission rates for influenza, COVID-19, and respiratory syncytial virus. We evaluated total cough counts alongside two population-normalised metrics, coughs per user and coughs per hour of sleep, and assessed temporal relationships nationally and regionally using cross-correlation with prewhitening. The strongest and most consistent associations were observed for NHS 111 ARI triage calls, where population-normalised cough metrics showed raw national correlations of approximately 0.95 and retained prewhitened correlations above 0.55 at lag 0. This indicates that nocturnal cough activity closely tracks short-term variation in an established syndromic surveillance indicator, beyond shared seasonality, long-term trends, and autocorrelation. Similar near-contemporaneous patterns were observed across regions. Population-normalised cough metrics also showed epidemiologically plausible leading associations with pathogen-specific indicators: coughs per hour of sleep peaked one week before influenza PCR positivity, while both coughs per user and coughs per hour of sleep peaked one week before COVID-19 PCR positivity. Hospital-based indicators showed weaker and more heterogeneous relationships, but the normalised cough metrics still showed plausible temporal alignment with influenza and COVID-19 admissions, including contemporaneous associations with influenza admissions and short leading associations with COVID-19 admissions. In contrast, unnormalised total cough counts produced less stable and often non-interpretable lag structures, consistent with sensitivity to variation in observation volume. These findings suggest that passive, near-real-time nocturnal cough monitoring can provide a population-level signal of respiratory symptom burden, with greatest utility as a broad syndromic indicator that complements surveillance sources affected by healthcare-seeking behaviour, laboratory turnaround times, backfilling, and reporting delays.

14

Multi-model forecasting of respiratory disease activity in Germany during the 2024-2025 season

Bracher, J.; Wolffram, D.; Amaral Lind, R.; Bardeck, N.; Boehm, M.; Contreras, S.; Doenges, P.; Guenther, F.; Kaiser, R.; van de Kassteele, J.; Kuhlmann, A.; Lange, B.; Nemcova, B.; Priesemann, V.; Reinacher, U.; Rodiah, I.; Sandmann, F.; the RESPINOW Study Group, ; Schienle, M.

2026-07-21 epidemiology 10.64898/2026.07.20.26358471 medRxiv

Top 12%

0.1%

Show abstract

Respiratory diseases cause considerable morbidity in autumn and winter and are a priority in public health monitoring. In Germany, they are subject to a number of surveillance systems, including both pathogen-specific and syndromic indicators. In this paper we present a collaborative multi-target and multi-model real-time forecasting system rolled out during the 2024/25 season, and discuss differences to earlier efforts carried out during the COVID-19 pandemic. A total of nine models were run to generate forecasts of general practitioner consultations for acute respiratory infections (ARI), hospitalizations for severe acute respiratory infections (SARI) and confirmed cases of seasonal influenza and RSV. As all indicators were subject to retrospective revisions, forecasting models were combined with a nowcasting step. Whenever multiple models were available for the same indicator, we combined them into an ensemble. Nowcasts showed convincing performance, even though for some models Christmas break effects led to an upward bias in early January. Forecasts were overall well-calibrated and most models outperformed simple benchmark models. These improvements were generally more substantial for age-stratified than pooled targets, and concentrated at lead times of two to three weeks. Anticipating the peak timing and magnitude proved to be challenging, with many models predicting too flat curves with a too early turnaround (e.g. already in late January rather than mid-February for SARI). The combined ensemble forecast was among the best-performing approaches, but unlike in previous related projects did not consistently outperform individual models. We conclude by discussing learnings on the organization of collaborative forecasting projects in post-COVID-19 times and the potential of AI-supported modelling.

15

Privacy-Preserving Matching for Federated Causal Inference in Multicentre Patient Cohorts

Gusinow, R.; Morgan, A. S.; Canziani, L. M.; Zeitlin, J.; Kim, M.; Gentilotti, E.; Ghosn, J.; Florence, A.-M.; Tami, A.; Toschi, A.; Palacios-Baena, Z. R.; Tacconelli, E.; Hasenauer, J.

2026-07-19 epidemiology 10.64898/2026.07.16.26358171 medRxiv

Top 12%

0.1%

Show abstract

Causal effect estimates can often be biased in clinical and epidemiological studies as patient cohorts frequently exhibit substantial covariate imbalances between treated and control groups, often amplified in multicentre studies due to heterogeneous recruitment, clinical practice, and case mix. Covariate balancing methods are therefore essential for valid causal inference. However, their application becomes challenging when data are distributed across cohorts and cannot be pooled because of privacy, legal, or institutional constraints, leaving a gap in practical methods for causal effect estimation in federated and imbalanced clinical data settings. We develop a privacy-preserving framework for covariate balancing and causal effect estimation across distributed data providers, combining federated aggregation with differential privacy to enable propensity score subclassification and matching without sharing individual-level records. Matching relies on non-disclosive quantities and differentially private distance evaluation, and the resulting matched subsets remain local to each server. Balance can be assessed through federated diagnostics and privacy-preserving visualisations, and we provide secure estimators for average treatment effects with associated uncertainty quantification. We implement this framework in the DataSHIELD federated analysis platform via 2 R packages. In simulations, we demonstrate agreement between federated and centralised analyses in the absence of privacy noise and quantify the bias--variance trade-offs induced by differential privacy. We illustrate applicability in two multinational settings-a Long COVID cohort and very preterm birth cohorts-showing that the approach enables practical causal analyses under real-world data protection constraints. The DataSHIELD packages are available on Github. Additional methodological details are provided in the Supplementary Material.

16

Longitudinal multiomic network rewiring at the complement coagulation interface in post-acute sequelae of COVID 19 (PASC)

Ward, B.; Belkhir, L.; Balligand, J.-L.; Cani, P. D.; De Greef, J.; Dewulf, J. P.; Gatto, L.; Haufroid, V.; Kabamba, B.; Vertommen, D.; Yombi, J. C.; Elens, L.; Bommer, G.; Bamps, L.

2026-07-16 infectious diseases 10.64898/2026.07.14.26358048 medRxiv

Top 13%

0.1%

Show abstract

Background. Post acute sequelae of COVID 19 (PASC) is clinically heterogeneous and mechanistically unresolved, and single-analyte studies have struggled to explain it. Methods. We profiled matched plasma proteomics, metabolomics and whole-blood transcriptomics at acute infection and convalescence (mean 86 days later) in a Belgian cohort, using linear mixed models, multiomic gene-set enrichment, and a degree-matched differential-correlation approach to quantify how each node's interactions were rewired between patients who developed PASC and those who recovered; seven axis proteins were additionally quantified by multiplex immunoassay as orthogonal validation. Findings. Single omic testing yielded few FDR significant features, yet multi-omic enrichment showed sustained complement cascade involvement from acute illness to follow-up in PASC. Correlation networks re-organised topologically toward C3 and lost the immunoglobulin V gene coexpression seen in recovery. The most rewired nodes, heparin cofactor II (SERPIND1), alpha 1 antitrypsin (SERPINA1), complement factor H related 5 (CFHR5), prothrombin/thrombin (F2) and immunoglobulin V gene transcripts (notably IGLV3 21), changed in their co-expression structure rather than in abundance. In multiplex validation, acute CRP was elevated in patients who developed PASC (FDR = 0.012), whereas the directly measured abundances of the network-nominated proteins were unchanged. Interpretation. These trajectory aware, cross omic networks nominate a thrombo inflammatory axis in which complement and coagulation regulation remain dysregulated in PASC at the level of wiring rather than abundance, providing a systems framework for validation and for exploring interventions at the complement coagulation platelet interface.

17

Discordant associations of IGF-binding proteins 1 & 2 with diabetes and cardiovascular disease: insights from UK Biobank

Rolfe-Hammerton, E. R.; Conning-Rowland, M. S.; De Faveri, L. E.; Simmons, K. J.; Meakin, P. J.; Cubbon, R. M.; Wheatcroft, S. B.

2026-07-20 endocrinology 10.64898/2026.07.17.26358347 medRxiv

Top 13%

0.1%

Show abstract

The insulin-like growth factor (IGF)/IGF-binding protein (IGFBP) axis has been implicated in diabetes mellitus and the associated burden of cardiovascular complications. Higher circulating levels of IGFBP-1 and IGFBP-2 have been established as markers of protection from incident type 2 diabetes, yet their associations with cardiovascular disease remain unclear. Utilising the UK Biobank (UKB) resource to integrate disease outcomes, plasma proteomics and MRI data, we examined associations of IGFBP-1 and IGFBP-2 with incident diabetes and cardiovascular disease. Approximately 50,000 UKB participants with plasma proteomic measurements for IGFBP-1 and IGFBP-2 were included. Multivariate Cox regression models revealed that participants in the highest quartiles of IGFBP-1 and IGFBP-2 had a substantially lower risk of incident diabetes (hazard ratio (HR) = 0.31 and 0.32 respectively), but, paradoxically, had increased risks of incident macrovascular disease, all-cause and cardiovascular-related mortality (HR = 1.81 and 2.39). Both proteins were negatively associated with HbA1c levels, triglyceride/HDL ratio and abdominal adiposity, yet positively associated with NT-proBNP, troponin I, cardiac chamber size and aortic dimensions. In summary, negative associations of IGFBP-1 and IGFBP-2 with incident diabetes mellitus did not translate to a reduced cardiovascular risk, suggesting potentially complex actions of IGFBP-1 and IGFBP-2 in the pathophysiology of cardiometabolic disease.

18

FoodScribe: an open-source semantic framework for nutrient estimation from free-text dietary records

Gouda, H.; Sala Climent, M.; Agongo, J.; Gaikwad, S. P.; Nattakom, A.; Zhao, H. N.; Xing, S.; Boland, B. S.; Holt, T.; Guma, M.; Dorrestein, P. C.

2026-07-17 nutrition 10.64898/2026.07.15.26358181 medRxiv

Top 13%

0.1%

Show abstract

Efficiently summarizing dietary records at scale remains a persistent bottleneck in nutritional epidemiology. We present FoodScribe, which translates free-text meal descriptions into quantitative nutrient profiles by combining ingredient parsing with nutrient retrieval by querying the USDA FoodData Central (FDC) database. Benchmarked using three LLM providers using Nutribench dataset, FoodScribe completed annotation of 3,807 meal descriptions in 2.5 hours, a task otherwise requiring substantial manual effort from trained nutritionists. FoodScribe achieved accuracy across macronutrient estimation (F1=0.79-0.89), with models performing better for protein than fat estimation. Application to a Mediterranean diet intervention cohort indicated dietary shifts consistent with the intervention pattern based on model-derived estimates. Integration with metabolomics data suggested that fiber and vegetable intake were positively associated with a fecal metabolite cluster.

19

A ReAct Agentic AI System for Natural Language Querying and Statistical Analysis of The Cancer Genome Atlas Clinical Data

Korutla, R.; Amal, S.

2026-07-17 health informatics 10.64898/2026.07.15.26358188 medRxiv

Top 14%

0.1%

Show abstract

The Cancer Genome Atlas (TCGA) holds clinical data for over 11,000 patients across 33 cancer types, but access is hard because of complex file structures, heterogeneous formats, and the need for programming. We present an agentic system for natural language querying and statistical analysis of TCGA clinical data. The system uses a large language model as an autonomous ReAct agent that selects from eight computational tools, including data extraction, descriptive statistics, Kaplan-Meier survival analysis with log-rank tests, hypothesis testing, and verification against the curated TCGA Pan-Cancer Clinical Data Resource (CDR). The agent reasons about intermediate results, adapts its approach, and returns clinically contextualized responses with source attribution and auditable traces. We introduce TCGA-Agent-Bench, 440 queries across five difficulty tiers with ground truth from the independently curated TCGA-CDR, evaluated with dual metrics of numerical accuracy and clinical completeness. The system achieves 93.4% overall accuracy (100% single-patient lookups, 99.1% cohort statistics, 92.8% comparative analyses), outperforming a fixed rule-based pipeline (87.1%), a single-pass LLM (81.8%), and retrieval-augmented generation (66.9% on a subset). Most of the benchmark is answerable from the CDR alone, so we locate the extraction layer's value in fields the CDR lacks (drug treatments, TNM components, biomarkers, biospecimen metadata): on 26 queries targeting these, the full system answers 100% versus 3.8% for CDR-only. Ablations show the reasoning loop is most impactful (+9.1% accuracy, +22.0 completeness points). A tool-based agentic architecture enables accurate, auditable analysis of clinical repositories, with value driven by tool design and recovered fields rather than model scale.

20

LocusBlend: Flexible multi-index regional visualization of genomic association signals

yang, c.; Cook, N.; Zeng, Y.; Fu, T.; budde, J.; Cruchaga, C.; Belloy, M. E.

2026-07-21 genetic and genomic medicine 10.64898/2026.07.15.26358129 medRxiv

Top 15%

0.1%

Show abstract

Summary It has become standard practice to visualize regional signals from genomewide association studies GWAS using LocusZoom plots Similarly GWAS signals are compared to regionally matched quantitative trait loci QTLs ie varianttogene regulation data using LocusCompare plots to aid assessment of candidate traitrelated genes Despite broad usage these tools annotate variants by linkage disequilibrium LD to a single lead or index variant This singleindex representation has limitations for visualizing complex loci that contain multiple independent signals We present LocusBlend an interactive web application for multiindex LDblended visualization of genomic loci LocusBlend supports one or two genomic association summarystatistic datasets and one to three index variants multiindex LocusZoom colorblended plots and matching LocusCompare visualizations Applications to Alzheimers disease GWAS and QTL signals illustrate LocusBlend enables visualization and separation of independent signals despite shared LD and high genomic complexity Overall LocusBlend is aimed at supporting researchers handle the continuously expanding complexity of human genomics findings Availability and Implementation LocusBlend is freely available at httpslocusblendwustledu Publication ready plots are generated in 1min Source code documentation example datasets input templates and reproducibility instructions are available at httpsgithubcomBelloyLabLocusBlend LocusBlend is implemented in Python using Streamlit Plotly and PLINK Supplementary Information Supplementary data are available online